Skip to content

Conversation

@Qubitium
Copy link

@Qubitium Qubitium commented Oct 13, 2025

Triton kernel on first launch will call _bench and related code to cache best config to use for subsequent kernel launches. This part of the code is racy and will crash in GIL=0 (Python >= 3.13T). The crash will manifest in cryptic NoneObject has no xx attribute errors to the end-user when in reality it's a data race issue.

I encountered this doing GIL=0 thread based module parallel execution on multiple gpu in GPT-QModel, a quantization toolkit which has triton kernels.

EDIT: Please ignore the folowing block. PR has been updated to use global locks + per key future for multi-threading autotuning sync.

------ Out Dated ---
There were two methods of protection, locks or thread-local. I implemented thread-local for the following reasons:

  1. latency: thread-local (no locks) will be much faster for cache retrieval vs lock protected global cache.
  2. locks are messy and thread-local is actually easier to implement with less code.
  3. thread-local has obvious downside in persistence and duplication of triton configs + extra autotunes per gpu thread. I consider this a small downside with upsides.

Upsides:

  1. If you use python GIL=0 and care about performance, you will likely use thread pools and not launching a thread every single time. This nullifies any persistence downside issue in real-world applications. Launching new thread for every kernel execution is bad practice (as thread startup is much slower than any thing code/lock wise by many factors) and we should not worry too much optimizing for bad code practices.

  2. If you want to further optimzie your threading code, you will likely use persistent threads and bind each thread to a specific cuda:index. With global cache, triton would assume all cuda:index are the same gpu. This actually not the case for my setup and we shouldnt make this assumption for others. Benchmark/config-cache should be keyed to per cuda:index (unique gpu fingerprint) to be frank, but this for another PR/topic. With thread-local we actually achieve optimized kernel launch per cuda:index (potentially unique gpu) since developer should already persist a thread in pool and bind the thread to a cuda:index.

As such, I believe thread-local is a better option over global cache (with lcoks): more scalable, lower latency, and more accurate (indirectly). Unsure how much duplicated config memory this would cost so but I think it's minimal?

test_autotuner_thread_safety extra unit test has been added to existing to existing python/test/unit/runtime/test_autotuner.py test file. Unit test should run on GIL=0 env. Triton CI should setup a Python 3.14T runner for this test.

New contributor declaration

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • Add new test_autotuner_thread_safety inside existing python/test/unit/runtime/test_autotuner.py
  • Select one of the following.

    • I have not added any lit tests.

@Qubitium Qubitium requested a review from ptillet as a code owner October 13, 2025 14:58
@lezcano lezcano requested a review from peterbell10 October 14, 2025 07:47
@Jokeren
Copy link
Contributor

Jokeren commented Oct 14, 2025

I probably didn't get the use of thread local here. Doesn't it mean that we need to autotune T times if there are T concurrent threads?

@Jokeren
Copy link
Contributor

Jokeren commented Oct 14, 2025

Also if it happens that two threads use the same GPU, the measured perf won't be the same as the original case where a single thread uses the GPU. As kernels will be launched and executed in an interleaving way from each thread

@Qubitium
Copy link
Author

Qubitium commented Oct 14, 2025

Also if it happens that two threads use the same GPU, the measured perf won't be the same as the original case where a single thread uses the GPU. As kernels will be launched and executed in an interleaving way from each thread

Having a global cache or lock doesn't help this situation. You have two threads bound to gpu:0 and one thread is already running pre-tuned kernel work while the second thread also boudn to gpu:0 is entering autotune for the first time on a different kernel. I don't think we can control this. It is up to the end-user to launch new kernel serially and have exclusive locks to gpu:index to make sure the autotune benches are consistent and accurate.

@Qubitium
Copy link
Author

Qubitium commented Oct 14, 2025

I probably didn't get the use of thread local here. Doesn't it mean that we need to autotune T times if there are T concurrent threads?

Yes. This is a the down side of thread-local. I think for most cases, end-users wil launch a thread and have it persist for the life time of the model execution instead of luanching random threads each time they need to execute an kernel. The cost of launching threads and ctx switch is quite high. Maybe my view of how threads should be used is different from reality but in my view, users should not be tearing up and down gpu bound execution threads just like you don't launch and tear up gpu processes but have them persist for the duration of the program.

@Qubitium
Copy link
Author

@Jokeren I think global cache and thread-local cache has both advantages and disadvantages. In my world (GIL=0), thread-local makes more sense but since this PR is for everyone. I don't mind switching back to global cache with a threadig.lock if more users may/or beneift with a global cache usage case.

@Jokeren
Copy link
Contributor

Jokeren commented Oct 14, 2025

I believe that a thread-local implementation may introduce additional inconsistencies.
For instance, different threads could end up selecting different ``best'' kernels due to subtle measurement nuances.

@Qubitium
Copy link
Author

Qubitium commented Oct 15, 2025

I believe that a thread-local implementation may introduce additional inconsistencies. For instance, different threads could end up selecting different ``best'' kernels due to subtle measurement nuances.

@Jokeren Please re-review. Here are the changes since your last review:

  1. Global cache (revert thread-local)
  2. New global cache lock in _cache_lock (reentrant). cache renamed to _cache since this is locked/dangerous so make it private.
  3. To allow multi-thread concurrent autotune of same/different kernels, a CacheFuture keyed by autotuner key is created and stored in _cache_futures
  4. When thread enteres autotune and cache hit fails, it will create a CacheFuture based on key as a sync mech to block other threads that comes in with same kernel/key. The futures dict is protected by same _cache_lock.
  5. Disk cache appears safe as the cache file is also keyed internally (different files for different keys) and now protected with the same per key futures.

@Jokeren
Copy link
Contributor

Jokeren commented Oct 15, 2025

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

@Jokeren
Copy link
Contributor

Jokeren commented Oct 16, 2025

Let's keep the PR open until we have figured out more complete solutions for GIL=0.

@Qubitium
Copy link
Author

Qubitium commented Oct 17, 2025

Let's keep the PR open until we have figured out more complete solutions for GIL=0.

Yes. This PR only fixes the autotuner part and does not address a wider issue that may persist in other parts of triton. Are there plans, short term, long term to address this outside the scope of this PR? It would be good to piece-meal the nogil compat. There are lots of apis including python stdlib ones that are not safe under nogil (like gc.collect) and it would be hard to fix it in any single PR.

@Jokeren
Copy link
Contributor

Jokeren commented Oct 17, 2025

Are there plans, short term, long term to address this outside the scope of this PR?

Not yet. So I think we don't plan to merge this PR in the short term. cc @ThomasRaoux

@Qubitium
Copy link
Author

Are there plans, short term, long term to address this outside the scope of this PR?

Not yet. So I think we don't plan to merge this PR in the short term. cc @ThomasRaoux

Ok. I am hoping this high impact, low risk and low hanging fruit PR will get the consideration before a full-on thread-safe triton roadmap. I am also willing to spend more time on this PR to address any lingering concerns of accuracy, safety or execution pattern deviation from main under gil and no-gil.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants